People sometimes think of transformers as specifying joint probability distributions over token sequences up to a given context length. However, I think this is sometimes not the best abstraction. From another POV, transformers take as input embedding vectors of a given dimension dmodel, and map to another output vector of dimension (# of tokens).
This is important in that it emphasizes the implementation of the transformer, and helps avoid possible missteps around thinking of transformers as “trying to model text” or various forms of finetuning as basically “conditioning” the model, and also accounts for the success of e.g. soft prompts (which don’t correspond to any token sequences).
People sometimes think of transformers as specifying joint probability distributions over token sequences up to a given context length. However, I think this is sometimes not the best abstraction. From another POV, transformers take as input embedding vectors of a given dimension dmodel, and map to another output vector of dimension (# of tokens).
This is important in that it emphasizes the implementation of the transformer, and helps avoid possible missteps around thinking of transformers as “trying to model text” or various forms of finetuning as basically “conditioning” the model, and also accounts for the success of e.g. soft prompts (which don’t correspond to any token sequences).